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ABSTRACT 



One embodiment of a speech recognition system is orga- 
nized with speech input signal preprocessing and feature 
extraction followed by a fuzzy matrix quantizer (FMQ) 
designed with respective codebook sets at multiple signal to 
noise ratios. The FMQ quantizes various training words 
from a set of vocabulary words and produces observation 
sequences O output data to train a hidden Markov model 
(HMM) processes Xj and produces fuzzy distance measure 
output data for each vocabulary word codebook. A fuzzy 
Viterbi algorithm is used by a processor to compute maxi- 
mum likelihood probabilities PR(0|Xj) for each vocabulary 
word. The fuzzy distance measures and maximum likeli- 
hood probabilities are mixed in a variety of ways to pref- 
erably optimize speech recognition accuracy and speech 
recognition speed performance. 

35 Claims, 3 Drawing Sheets 
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ADAPTIVE SPEECH RECOGNITION WITH 
SELECTIVE INPUT DATA TO A SPEECH 
CLASSIFIER 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to speech recognition systems, 
more particularly to speech recognition systems providing 
selective input data to a speech classifier, such as neural 
network to balance speech recognition accuracy and speech 
recognition system performance. 

2. Description of the Related Art 

Speech is perhaps the most important communication 
method available to mankind. It is also a natural method for 
man-machine communication. Man-machine communica- 
tion by voice offers a whole new range of information/ 
communication services which can extend man's 
capabilities, serve his social needs, and increase his produc- 
tivity. Speech recognition is a key element in establishing 
man-machine communication by voice, and, as such, speech 
recognition is an important technology with tremendous 
potential for widespread use in the future. 

Voice communication between man and machine benefits 
from an efficient speech recognition interface. Speech rec- 
ognition interfaces are commonly implemented as Speaker- 
Dependent (SD)/Speaker-Independent (SI) Isolated Word 
Speech Recognition (IWSR)/continuous speech recognition 
(CSR) systems. The SD/SI IWSR/CSR system provides, for 
example, a beneficial voice command interface for hands 
free telephone dialing and interaction with voice store and 
forwarding systems. Such technology is particularly useful 
in an automotive environment for safety purposes. 

However, to be useful, speech recognition must generally 
be very accurate in correctly recognizing (classifying) the 
speech input signal with a satisfactory probability of accu- 
racy. Difficulty in correct recognition arises particularly 
when operating in an acoustically noisy environment. Rec- 
ognition accuracy may be severely and unfavorably 
impacted under realistic environmental conditions where 
speech is corrupted by various levels of acoustic noise. 

FIG. 1 generally characterizes a speech recognition pro- 
cess by the speech recognition system 100. A microphone 
transducer 102 picks up a speech input signal and provides 
to signal preprocessor 104 an electronic signal representa- 
tion of the speech input signal 101. The speech input signal 
101 is an acoustic waveform of a spoken input, typically a 
word, or a connecting string of words. The signal prepro- 
cessor 104 may, for example, filter the speech input signal 
101, and a feature extractor 106 extracts selected informa- 
tion from the speech input signal 101 to characterize the 
signal with, for example, cepstral frequencies or line spectral 
pair frequencies (LSPs). 

Referring to FIG. 2, more specifically, feature extraction 
in operation 106 is basically a data-reduction technique 
whereby a large number of data points (in this case samples 
of the speech input signal 101 recorded at an appropriate 
sampling rate) are transformed into a smaller set of features 
which are "equivalent", in the sense that they faithfully 
describe the salient properties of the speech input signal 101. 
Feature extraction is generally based on a speech production 
model which typically assumes that the vocal tract of a 
speaker can be represented as the concatenation of lossless 
acoustic tubes (not shown) which, when excited by excita- 
tion signals, produces a speech signal. Samples of the speech 
waveform are assumed to be the output of a time-varying 
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filter that approximates the transmission properties of the 
vocal tract. It is reasonable to assume that the filter has fixed 
characteristics over a time interval of the order of 10 to 30 
milliseconds (ms). Thus, a short-time speech input signal 

5 portion of speech input signal 101 may be represented by a 
linear, time-invariant all pole filter designed to model the 
spectral envelope of the signal in each time frame. The filter 
may be characterized within a given interval by an impulse 

^ response and a set of coefficients. 

Feature extraction in operation 106 using linear predictive 
(LP) speech production models has become the predominant 
technique for estimating basic speech parameters such as 
pitch, formants, spectra, and vocal tract area functions. The 

15 LP model allows for linear predictive analysis which basi- 
cally approximates a speech input signal 101 as a linear 
combination of past speech samples. By minimizing the sum 
of the squared differences (over a finite interval) between 
actual speech samples and the linearly predicted ones, a 

20 unique set of prediction filter coefficients can be determined. 
The predictor coefficients are weighting coefficients used in 
the linear combination of past speech samples. The LP 
coefficients are generally updated very slowly with time, for 

25 example, ever; 10-30 ms, to represent the changing vocal 
tract, LP prediction coefficients are calculated using a vari- 
ety of well-known procedures, such as autocorrelation ard 
covariance procedures, to minimize the difference between 
the actual speech input signal 101 and a predicted speech 

3Q input signal 101 often stored as a spectral envelope reference 
pattern. The LP prediction coefficients can be easily trans- 
formed into several different representations including cep- 
stral coefficients and line spectrum pair (LSP) frequencies. 
Details of LSP theory can be found in N. Sugamura, "Speech 

35 Analysis and Synthesis Methods Developed at ECL in 
NTT-from LPC to LSP", Speech Communication 5, Elsevier 
Science Publishers, B. V, pp. 199-215 (1986). 

Final decision-logic classifier 108 utilizes the extracted 
information to classify the represented speech input signal to 

40 a database of representative speech input signals. Speech 
recognition classifying problems can be treated as a classical 
pattern recognition problem. Fundamental ideas from signal 
processing, information theory, and computer science can be 
utilized to facilitate isolated word recognition and simple 

45 connected-word sequences recognition. 

FIG. 2 illustrates a more specific speech recognition 
system 200 based on pattern recognition as used in many 
IWSR type systems. The extracted features representing 

5Q speech input signal 101 are segmented into short-term 
speech input signal frames and considered to be stationary 
within each frame for 10 to 30 msec duration. The extracted 
features may be represented by a P-dimensional vector and 
compared with predetermined, stored reference patterns 208 

55 by the pattern similarity operation 210. Similarity between 
the speech input signal 101 pattern and the stored reference 
patterns 208 is determined in pattern similarity operation 
210 using well-known vector quantization processes. The 
vector quantization process yields spectral distortion or 

6Q distance measures to quantify the score of fitness or close- 
ness between the representation of speech input signal 101 
and each of the stored reference patterns 208. 

Several types of spectral distance measures have been 
studied in conjunction with speech recognition including 
65 LSP based distance measures such as the LSP Euclidean 
distance measure (dLSP) and weighted LSP Euclidean dis- 
tance measure (dWLSP). They are defined by 



11/28/2003, EAST Version: 1.4.1 




6,044,343 



10 



where, fj^i) and f^i) are the ith LSPs of the reference and 
speech vectors, respectively. The factor "w(i)" is the weight 
assigned to the ith LSP and P is the order of LPC filter. The 
weight factor w(i) is defined as: 

where P(f) is the LPC power spectrum associated with the 
speech vector as a function of frequency, f, and r is an 
empirical constant which controls the relative weights given 
to different LSPs. In the weighted Euclidean distance 20 
measure, the weight assigned to a given LSP is proportional 
to the value of LPC power spectrum at this LSP frequency. 

The decision rule operation 212 receives the distance 
measures and determines which of the reference patterns 
208 the speech input signal 101 most closely represents. In 25 
a "hard" decision making process, speech input signal 101 
is matched to one of the reference patterns, 208. This 
one-to-one "hard decision" ignores the relationship of the 
speech input signal 101 to all the other reference patterns 
208. Fuzzy methods have been introduced to provide a better 30 
match between vector quantized frames of speech input 
signal 101 and reference patterns 208. In a "soft" or "fuzzy" 
decision making process, speech input signal 101 is related 
to one or more reference patterns 208 by weighting coeffi- 
cients. 35 

Matrix quantization has also been used to introduce 
temporal information about speech input signal 101 into 
decision rule operation 212. Fuzzy analysis methods have 
also been incorporated into matrix quantization processes, as 
described in Xydeas and Cong, "Robust Speech Recognition 40 
In a Car Environment", Proceeding of the DSP95 Interna- 
tional Conference on Digital Signal Processing, Jun. 26-28, 
1995, Limassol, Cyprus. Fuzzy matrix quantization allows 
for "soft" decision using interframe information related to 
the "evolution" of the short-term spectral envelopes of 45 
speech input signal 101. 

However, speech recognition technology still does not 
have a perfect recognition accuracy, and recognition accu- 
racy particularly declines as acoustic signal to noise ratios 
(SNR) decrease. Also, speech recognition system perfor- 50 
mance declines as more vocabulary words; are targeted for 
recognition. Accordingly, a need exists to improve speech 
recognition accuracy. Additionally, a need exists to increase 
the overall speed performance of speech recognition systems 
while maintaining satisfactory speech recognition accuracy. 55 

SUMMARY OF THE INVENTION 

In one embodiment, speech recognition system accuracy 
and performance may be balanced by, for example, provid- 
ing multiple sources of speech input signal information to a 
speech classifier of a higher processing level such as a neural 
network. Furthermore, in one embodiment, speech recogni- 
tion system speed performance may be selectively enhanced 
without substantial compromise in speech recognition accu- 
racy by selectively providing less speech input signal infor- 
mation to a speech classifier when, for example, a speech 
input signal is corrupted by high SNR levels, where the 



increased recognition gains achieved by providing more 
information to the speech classifier are offset by the speech 
recognition system processing speed penalty. Additionally, 
speech recognition system speed performance may be selec- 
tively enhanced without substantial compromise in speech 
recognition accuracy by selectively providing less speech 
input signal information to a speech classifier when speed 
performance is noticeably degraded by, for example, using 
a large number of vocabulary words that strain available 
computational resources. 

In one embodiment of the present invention, a speech 
recognition system includes a first speech signal preproces- 
sor to receive first input data representing a speech input 
signal and having first speech input signal preclassifying 
output data and a second speech signal preprocessor to 
receive second input data representing the speech input 
signal and having second speech input signal preclassifying 
output data. The speech recognition system further includes 
a mixer to receive the first and second speech input signal 
preclassifying output data and having output data repre- 
sented by a selected mix of the first and second speech input 
signal preclassifying output data and a speech classifier to 
receive the selected mix of the first and second word 
preclassifying output data and having output data to classify 
the speech input signal. 

In another embodiment of the present invention, a speech 
recognition method includes the steps of processing first 
speech input signal data to preclassify the speech input 
signal and produce first preclassifi cation output data, 
wherein the first speech input signal data represents a speech 
input signal, processing second speech input signal data to 
preclassify the speech input signal and produce second 
preclassification output data, and determining a preferred 
mix of the preclassification output data. The method further 
includes the steps of mixing the first and second preclassi- 
fication output data in accordance with the determined 
preferred mix and classifying the speech input signal based 
on the preferred mix of preclassification output data. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Features appearing in multiple figures with the same 
reference numeral are the same unless otherwise indicated. 

FIG. 1, labeled prior art, illustrates a general speech 
recognition system. 

FIG. 2 illustrates a pattern -recognition based speech rec- 
ognition system. 

FIG. 3 illustrates an FMQ/HMM/NN speech recognition 
system embodiment with selective data input to the NN 
using a single codebook per vocabulary word per SNR level 
for training. 

FIG. 4 illustrates another FMQ/HMM/NN speech recog- 
nition embodiment with selective data input to the NN using 
a single codebook per vocabulary word. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

The following description of the invention is intended to 
60 be illustrative only and not limiting. 

This description uses the following abbreviations: 
FMQ — Fuzzy Matrix Quantization 
FVQ — Fuzzy Vector Quantization 
65 MQ — Matrix Quantization 

HHM— Hidden Markov Model 
X — a HMM process 
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Pr(0|K) — Probability of model/process X. producing 

observation 0 
NN — Neural network 

MLP — Multilevel Perceptron neural network 
LSP — Line Spectral Pair 
Db — decibel 

FD — Fuzzy distance measure 

IWSR — Isolated Word Speech Recognition 

SNR — Signal to Noise Ratio 

Referring to an embodiment of a speech recognition 
system in FIG. 3, IWSR speech recognition system 300 
combines the classification power of speech classifier neural 
network 302 with temporal distortion and probability infor- 
mation derived from frames of input speech signal 304 with 
speech preprocessors to classify speech signal 304 from a 
predetermined set of vocabulary words. Additionally, speech 
recognition system 300 includes noise level detection cir- 
cuitry and MLP neural network 302 input data selection 
control :o dynamically yield satisfactory recognition accu- 
racy while tailoring recognition process speed performance 
to a user environment. In preparation for IWSR, speech 
recognition system 300 undergoes a training process of 
designing FMQ 309 codebooks and robust FMQ 308 
codebooks, training u hidden Markov models in preclassifier 
HMMs 306, and training neural network 302. A data base of 
u words repeated r times and corrupted by s different levels 
of acoustic noise is used during the training process, where 
u corresponds to a number of vocabulary words of speech 
recognition system 300, and s and r are positive integers, for 
example, seven and thirty, respectively. 

Speech recognition system 300 is designed to classify an 
input speech signal 304 word as one of u predetermined 
vocabulary words. During training, the FMQ 307 is a front 
end to HMMs 306 and MLP neural network 302. Speech 
recognition system 300 uses an observation sequence 0„ of 
probability mass vectors from FMQ 307 to train the HMMs 

306 and uses mixed input data which may have fuzzy 
distance measures to train MLP neural network 302. Signal 
modeling based on HMMs 306 can be considered as a 
technique that extends conventional stationary spectral 
analysis principles to the analysis of the quantized time- 
varying speech input signal 304. The time-varying quantized 
properties of speech input signal 304 are used by HMMs 306 
and Viterbi algorithm 310 to describe speech signal 304 
probabilistically. 

Initially during training of speech recognition system 300, 
for each of u vocabulary words, an FMQ codebook in FMQ 

307 is designed from a database of u times r (ur) words for 
each of s SNR levels. FMQ 307 uses the us codebooks 309 
for training neural network 302 using the us database at each 
of the s SNR levels. Thus, a total training database has u 
times r times s (urs) entries. Each of the u times s times r 
(usr) words is input to speech recognition system 300 as 
speech input signal 304 and preprocessed by preprocess 
operation 312 which, for example, band limits speech signal 
304 to 3.6 kHz and samples speech signal 304 at 8 ksamples/ 
sec with a resolution of 16 bits per sample. During speech 
recognition, when continuous speech is produced, voice 
activity detector (VAD) 314 effectively defines end points of 
input words for IWSR. A P order linear predictive code 
(LPC) analysis is performed in LPC operation 316 on a 20 
msec frame of speech signal 304 with a 10 msec overlap 
between frames to compute the LPC coefficients for the 
speech signal 304 frame using, for example, the Burg 
algorithm. P may vary depending on trade offs between 
desired resolution and processing speed and in this 
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embodiment, X is assumed to be in the range often to 
sixteen. Frame times may vary and are, in general, chosen to 
represent an approximately static vocal tract period in a 
range of, for example, 10-30 msec. The training process 

5 follows the path through path position 1, to LSP operation 
317 where line spectral pair frequencies are derived in a 
well-known manner from the respective LPC coefficients. 
LSP_(SNR) operations 318, 320, 322, 324, 326, 328, and 
330 indicate that line spectral pair frequencies (coefficients) 

10 are generated by LSP operation 317 for each speech signal 
304 frame for all seven SNR levels from the LPC coeffi- 
cients. 

In the embodiment of FIG. 3, the respective SNR levels 
used to train speech recognition system 300 are clean speech 

15 (oo), 35 dB, 25 dB, 24 dB, 18 dB, 12 dB, and 6 dB to model 
various noises in an automotive environment. Other SNR 
values may be chosen to model other speech environments 
or more extensively model the automotive environment. 
Speech recognition system 300 is designed for robustness by 

20 training with multiple acoustic noise SNR corruption levels 
to better model realistic speech signal 304 input conditions 
where speech is corrupted by acoustic noise. 

The LSP representations of speech signal 304 are used to 
define a spectral envelope because they provide a robust 

25 representation of the speech short-term magnitude spectral 
envelope of speech signal 304. Band limited input distortion 
affects only a subset of LSP coefficients, as compared to the 
case of a cepstral representation where input noise corrupts 
all the coefficients. Additionally, LSP parameters have both 

30 well-behaved dynamic range and filter stability preservation 
properties and can be coded more efficiently than other 
parameters. As a result, the LSP representation can lead to 
a 25-30% bit-rate reduction in coding the filter (vocal tract) 
information, as compared to the cepstral coefficient repre- 

35 sentation. Furthermore, spectral LSP sensitivities are 
localized, i.e., a change in a given LSP produces a change in 
the LP power spectrum only in its neighborhood frequen- 
cies. For example, a change in an LSP from 1285 Hz to 1310 
Hz affects the LP power spectrum near 1300 Hz. This is 

40 particularly useful when speech is corrupted by narrow band 
noise in which case only a subset of LSP parameters are 
affected by the input noise. 

In general given a short segment of speech signal 304 and 
the corresponding all-pole filter H(z)=G/A(z), where A(z) is 

45 the inverse filter given by 

A(2)-l+a 1 ?" 1 +a 2 r- 2 + . . . ^a P T p 

where P is the order of the predictor and {a,-} are the 
50 prediction coefficients, the LSPs are defined by decompos- 
ing the inverse filter polynomial into two polynomials, 

and 

55 

e(2)^AC2)-Z^ J U(z-l), 

where P(z) is a symmetric polynomial, Q(z) is an anti- 
symmetric polynomial and 

The roots of the polynomials P(z) and Q(z) define the LSP 
frequencies. 

Each of the us FMQ 309 codebooks for a given vocabu- 
65 lary word is designed by developing a matrix entry from a 
corresponding speech signal 304 input word W rtW n=l, 2, 
. . . u, k-1, 2, . . . , s, m-1, 2, . . . , r, from the database of 
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usr words. The r matrix entries for each of the u words at 
each of the s SNR levels are used to design the us respective 
FMQ 309 codebooks for a respective group of r matrix 
entries. Each of the us groups is processed to optimally 
cluster each of the r entries for each separate codebook into 
C cells. A centroid is computed for each of the C cells for 
minimum quantization distortion using, for example, a 
Fuzzy C-algorithm or a fuzzy Linde-Buzo-Gray (LBG) 
algorithm as illustratively discussed in chapter 3, section 
3.3.4 of the Doctor of Philosophy thesis of Lin Cong entitled 
"A Study of Robust IWSR Systems" and located in the John 
Rylands University Library of Manchester in Manchester, 
England, which thesis is hereby incorporated by reference in 
its entirety, and further illustratively discussed in C. S. 
Xydeas and Lin Cong, "Robust Speech Recognition Using 
Fuzzy Matrix Quantisation, Neural Networks and HiddeD 
Markov Models", pp. 1587-1590, EUSIPCO-96, Vol. 1, 
September, 1996, which is also incorporated by reference io 
its entirety. Thus, us matrix codebooks (MCBnk) in FMQ 
307 are formed. 

The u FMQ 308 codebooks are also designed by devel- 
oping a matrix entry for each input word W nJb „, n-1, 2, . . . 
u, k=l, 2, . . . , s, m=l, 2, . . . , r, from the database of urs 
words. The sr matrix entries for each of the u words are 
processed to optimally cluster each entry into C cells. A 
centroid for each of the C cells is computed for each of the 
u FMQ 308 separate codebooks for minimum quantization 
distortion using, for example, the fuzzy C-algorithm or the 
fuzzy Linde-Buzo-Gray (LBG) algorithm as discussed in 
chapter 3, section 3.3.4 of the Doctor of Philosophy thesis of 
Lin Cong entitled "A Study of Robust IWSR Systems". 

The us codebooks 309 and u codebooks 308 utilize 
interframe information related to the "evolution" of the 
speech short-term spectral envelopes of speech signal 304 
by operating on N consecutive speech frames of speech 
signal 304. The us codebooks 309 and u codebooks 308 are 
designed separately using the database urs words. However, 
the following representation and quantization of word 
generically represent the training of and quantization with 
codebooks 309 and 308. 

Each frame is represented by P LSP coefficients, and, 
thus, an N frames speech input signal segment provides a 
PxN matrix of LSP coefficients. Each matrix entry for a 
speech signal 304 input word W^, may be designed using 
a training set of TO speech spectral vectors for each of TO 
frames of each speech signal 304 word W wJfcm , which result 
in a set X^{x l7 K 2> . . . ,x T } of T, PxN matrices for each 
speech signal 304 word where T»int(TO/N) 
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where vkijHx^/. . . x^']', j-1,2, . . . J*. 

io Continuing the training process of speech recognition 
system 300, each of the training word entries in the urs 
training word database are provided as a respective speech 
signal 304 training word. Quantization of a word W,^ 
occurs in the same manner for each codebook of codebooks 

15 308 and 309. Each training speech signal 304 is prepro- 
cessed by preprocess operation 312, and LPC coefficients 
are determined in LPC operation 316 as described above. 
Each of the LPC coefficients are converted into respective 

20 line spectral pair frequencies by LSP operation 317. Each of 
the training words is represented by a respective set of 
the TO speech spectral vectors for each frame of each speech 
signal 304 word W Mjtm , which result in a set 
X^{x lt x 2i . . . ,x T } of T, PxN matrices for each speech signal 

25 304 word W^, where T=int(TO/N> 

A non-fuzzy matrix quantization of X can be described by 
a CxT classification matrix U of elements: 



30 



XktM 
Xk e Ai 



L2,., 

: I, 2, . 



. c 
J. 



45 



Furthermore, the elements of this MQ matrix satisfy the 
following two conditions: 

a) 



i.e., only one element in a column is equal to one; the 
remaining elements are zero. This implies that each matrix 
x^ is "quantized" to only one centroid of the matrix space. 



b) 



50 



55 



. : x J ^',j=l,2 ( ...,N,k-l,2, 
is grouped by word and SNR 
level to form the r entries in each of the corresponding us 
codebooks 309. The xkGHxj/x./. . . Xp/J, j=l, 2, . . . , N, 60 
k=l, 2, . . . , T for each word W nArn is grouped by word to 
form the rs entries for each of the corresponding u FMQ 308 
codebooks. The xk(j) for each word entry in a codebook is 
processed using, for example, the LBG algorithm, to yield a 
C-cell partitioning of the matrix space for each codebook 65 
and V-matrix entries containing C v t -, i=l,2, . . . ,C, PxN, 
codeword matrices 



this ensures that there is no empty cell in this C-class 
partitioning of the matrix space. 

The columns of indices O f> j=l,2, . . . ,T, of the classifi- 
cation matrix U "map" effectively an input matrix x,- into a 
vector Oj°{u 1Jt u 2 j, . , . ,u Q } with all zero values except one 
element u f -=1 indicating that the distance 



between x y and the ith cell is minimized. Note that each of 
the columns of relative closeness indices O fy j=l,2, . . . ,T, 
represents the input signal 304 at different instances in time. 
d(Xy, vj) is the distance measure 
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p 



and, for^epmpl&rto^staiic^^easure 
p 



This distance measure is the distance between the jth column 
vector Xy and v (> which is the centroid of the ith cell. Note 
that for a non-fuzzy MQ codebook, an optimum partition of 
the matrix space of codebooks 308 and 309 into respective 
C cells ensures that 



is minimized. Different distance measures utilize different 
quantization mechanisms for computing the "centroid" 
matrices v t . 

The fuzzy matrix quantization of each of the training 
words X for respective codebooks 308 and 309 is 

described by a CxT fuzzy classification matrix \J F with 
elements u ik e[0,l], i-1,2, . . . ,C, k-1,2, . . . ,T. The value 
of u ik , O^u^l, indicates the degree of fuzziness of the kth 
input matrix x k to the ith partitioning cell which is repre- 
sented by the centroid v t . The two conditions are also 
satisfied: 



Yjt*ik - l and YjWk >0 



In this case, u^ is derived as: 



Additionally, d(x 7> v,) may be the robust distance measure 
described in a concurrently filed U.S. patent application Ser. 
No. 08/883,980, filed Jun. 27, 1997, entitled "Robust Dis- 
tance Measure in a Speech Recognition System" by Safdar 
M. Asghar and Lin Cong, which is incorporated herein in its 
entirety. 

Furthermore, the overall distance of the C entries of a 
fuzzy matrix quantizer codebook operating on the X matrix 
set for a single word is 



j=i /=! 



20 
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Note that the summation of O y - entries is equal to unity. 
The largest component of 0 } is the one which corresponds 
to the codeword with the smallest d(xy, value. O y - can be 
interpreted as a probability mass vector relating the input 
matrix x ; to all v ( , i— 1,2, . . . ,C. The total observation 
sequence 0„ of probability mass vectors for each speech 
signal 304 word for one codebook is defined as O n **{Q lt 0 2 , 

A fuzzy distance measure FD„ , n=l, 2, . . . , u (words) and 
k=l, 2, . . . , s (acoustic noise levels) between an input speech 
signal 304 word and nth of the respective u codebooks at a 
respective kth of the s SNR levels in FMQ 307 is formed as: 



A fuzzy distance measure between an input speech signal 
304 and each of the respective u codebooks in FMQ 308 
40 FD„, n-1, 2, . . . , u, is formed as: 



45 



where the constant F influences the degree of fuzziness. 
d lVt (x*, v ,) are tne average distance measures as defined with 
reference to the MQ design. 50 

The columns of probability mass vectors 0 } - of the clas- 
sification matrix U F "map" an input matrix x j into a prob- 
ability mass vector of indices O— {u^-, u 2y -, . . . , u Cj } which 
results in the distance 

55 

c 



Equations J(U t V) = Ej Ej "w'^^' v i) 311(1 
j~i f-i 

T C 

■/(£/. V) = EZ"5^ v,,) 
j=l i=l 



respectively provide the MQ and FMQ distance and can also 
be represented by the general distance equation: 




When using LSP based distance measures, d(xy, v t ) equals 



60 



where 

F 



7=1 1= 

«f OS,- e [0,1] 



Fuzzy matrix quantization is further illustratively dis- 
cussed in Xydeas and Cong, "Robust Speech Recognition in 
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a Car Environment/' International Conf. on Digital Signal vocabulary word al the kth SNR level using the nth set of 

Processing,, Vol. 1, pp. 84-89, June, 1995, Cyprus, which is FMQ 309 codebooks designed at the kth SNR level. For 

herein incorporated by reference in its entirety, example, if the nth vocabulary word is "ten" and is corrupted 

During the training mode of speech recognition system by an SNR level of 06 dB, then the u codebooks designed 

300, the training input data for the hidden Markov models of 5 at the 06 dB SNR level are used to compute the u fuzzy 

classifier HMMs 306 are in one embodiment the observation distance measures FD/ which may be selectively used as 

sequences 0„ of probability mass vectors O,. from a classi- respective input data to u nodes of the MLP neural network 

fication matrix U. Each classification matrix U is generated . M . , , , ^ 

by FMQ 308 u codebooks from a fuzzy matrix quantized Refemng to FIG. 4, m another embodiment of a speech 

speech input signal for each of the training words W^, as 10 ™°f ltl0n SySt °™ 4 ™ ^ d j£ 

/ j . T ™«,r u 4- tical to speech recoemtion system 300 except that tMQ 407 

described above. HMMs 306 have a respective process X replaces FMQ 307 FMQ 407 has u codebooks that are 

n-1, 2,..., u, for each of then words. The rs words for each identical tQ ^ u codgtooo)a) 308 . Designing of the u 

respective u vocabulary words are, in one embodiment, co debooks of FMQ 407 is identical to the designing of the 

fuzzy matrix quantized to train a corresponding HMM respective, corresponding u codebooks 308. Quantization 

process X^. The multiple arrows from FMQ 307 to HMMs 15 us i ng me u codebooks of FMQ 407 is identical to quanti- 

306 indicate that all SNR levels of the ur training words are zat i on us j ng respective, corresponding u codebooks 308. 

used to train each of the HMM processes X n . Each of the The MLP neural network 302 for speech recognition 

observation sequences O n from FMQ 308 for each of the urs system 400 is trained using a variety of selective mixes of 

training words train corresponding HMM processes, X„, individual input data respectively generated using each of 

n=l, 2, . . . , u, i.e. for an nth single word input signal, an 20 the sr versions of the u vocabulary words in the training 

input observation sequence 0„ to an HMM X n only comes database. When one of the urs input speech signal 304 

from one codebook n. Fuzzy Viterbi algorithm operation training words is input to speech recognition system 300 as 

310, described in section 4.3 of L. Cong, "A Study of Robust speech signal 304, preprocess operation 312 preprocesses 

IWSR Systems" utilizes a respective observation sequence speech signal 304 and LPC operation 316 determines the 

O n from each of the rs versions of each of the u words and 25 prediction coefficients for each of the TO speech input signal 

a frizzy Viterbi algorithm to produce a maximum likelihood frames of speech signal 304. Each of the LPC coefficients for 

probability Pr(0 n \\ n ) of the HMM process X. n producing the each frame are respectively converted into respective line 

observation sequence 0„. Separate HMMs may be built for spectral pair frequencies by LSP operation 317. 

males and females, and the number of states of each HMM During training of MLP neural network 302, urs input 

is set to, for example, five. HMM training is further 30 speech signal 304 training database words are used to train 

described in chapter 2 and chapter 4, section 4.3 of L. Cong, MLP neural network 302. FMQ 407 is used to determine a 

"A Study of Robust TWSR systems". respective fuzzy distance measure FD„ between each of the 

In one embodiment neural network 302 is a multilayer urs input speech signal 304 words and the respective u FMQ 

perceptron type NN. Multilayer networks overcome many of 407 codebooks. Thus, each of the u codebooks in FMQ 407 

the limitations of single-layer networks. That is, 35 are used to determine a fuzzy distance measure FD n for each 

multilayered, hierarchical networks are more powerful version of each input signal 304 word at each SNR level, 

because of the nonlinearities and the internal representation The u fuzzy distance measures, {FD}„, one from each of the 

generated in the so-called hidden layers. The multiple nodes u FMQ 407 codebooks, for each of the urs input signal 304 

in the output layer typically correspond to multiple classes words may be respectively selected, in accordance with 

in the multi-class pattern recognition problem. In general, an 40 Table 2, to train MLP neural network 302. For example, if 

MLP neural network 302 has an ability to partition an input the nth vocabulary word is "ten" and is corrupted by an SNR 

pattern space in a classification problem and to represent level of 06 dB, then the u codebooks of FMQ 407 are used 

relationships between events. Additionally, MLP neural net- to compute the u respective fuzzy distance measures {FD}„ 

work 302 with multiple layers and sufficient interconnec- which may be selectively used as respective input data to u 

tions between nodes ensures an ability to "learn" complex 45 nodes of the MLP neural network 302. If the nth vocabulary 

classification boundaries, and implement nonlinear transfer- word is "ten" and is corrupted by an SNR level of 12 dB, 

mations for functional approximation problems. The MLP then the u codebooks of FMQ 407 are used to compute the 

neural network 302 has G hidden nodes where G is prefer- respective u fuzzy distance measures {FD}„ which may be 

ably determined empirically based upon the number of u selectively used as respective input data to u nodes of the 

vocabulary words, memory size, and processing capabilities. 50 MLP neural network 302, and so on. Thus, during training 

The MLP neural network 302 is trained using a variety of of MLP neural network 302, each of the u codebooks of 

selective mixes of individual input data respectively gener- FMQ 407 produces rs fuzzy distance measures for each of 

ated using each of the sr versions of the u vocabulary words the u vocabulary words. 

in the training database. When one of the urs training words Referring to FIGS. 3 and 4, during training of MLP neural 
is input to speech recognition system 300 as speech signal 55 network 302, if maximum likelihood probability input data 
304, preprocess operation 312 preprocesses speech signal derived from the u HMM processes X„ of HMMs 306 is 
304 and LPC operation 316 determines the prediction coef- selected, each of the u HMM processes \ n receive an 
ficients for each of the TO speech input signal frames of observation sequence O n from FMQ 308 (FMQ 407, FIG. 
speech signal 304. Each of the LPC coefficients for each 4). The u maximum likelihood probabilities {PROB} gen- 
frame are respectively converted into respective line spectral 60 erated by fuzzy Viterbi algorithm 310, as described above, 
pair frequencies by LSP operation 317. from each of the u HMM processes X n are used as input data 

Referring to FIG. 3, during training of MLP neural to u nodes of the MLP neural network 302. 

network 302 for speech recognition system 300, if fuzzy MLP neural network 302 provides u output signals, 

distance measure input data is selected, u fuzzy distance OUT(l), OUT(2), . . . , OUT(u), which assume values in the 

measures {FD}* input data is used. The us FMQ 309 65 region 0iOUT(n)€l, n-1, 2, . . . , u. The maxOUT(n) 

codebooks are used to determine a respective fuzzy distance represents the classification of speech signal 304 as the nth 

measures FD„* between each of the r versions of an nth vocabulary word. 
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Referring to FIG. 3, during training, mixer 336 provides 
several different mixes of input data selected from FMQ 
308/HMMs 306 and FMQ 309 codebooks to MLP neural 
network 302 of speech recognition system 300. Seven 
illustrative mixes are defined in Table 1. 

TABLE 1 



MIX 



MLP neural 
network 302 Input Data 



MIX1 
MIX2 
MIX3 
MIX4 
MIX5 
MIX6 
MIX7 



{FD} k 
{PROB} 
{COM} 
{FD, PROB} 
{FD, COM} 
{PROB, COM} 
{FD, PROB, COM} 



10 



20 



MIX1 represents that the u fuzzy distance measures 
{FD} fc for a given vocabulary word at a kth SNR level are 
directly applied to u input nodes of MLP neural network 
302. MIX2 represents that for a given vocabulary word all 
of the u HMMs 306 Pr' (0„|X n ) maximum likelihood prob- 
abilities applied directly to the u MLP neural network 302 
input nodes. MIX3 represents that a combination {COM} of 
the u fuzzy distance measures {FD}^ and u maximum 2 s 
likelihood probabilities {PROB} are applied to the u MLP 
neural network 302 input nodes. Each entry of the combi- 
nation {COM} is defined by FD„*-aPr , (OjX, I ) for n-1, 
2, . . . , u, where a is a scaling constant. MIX4 applies each 
entry of MIX 1 and MIX 2 to 2u respective MLP neural 
network 302 input nodes. MIX 5 applies each entry of MIX 
1 and MIX 3 to 2u respective MLP neural network 302 input 
nodes. MIX6 applies each entry of MIX 2 and MIX 3 to 2u 
respective MLP neural network 302 input nodes. MIX7 
applies each entry of MIX 1, MIX2, and MIX 3 to 3u 
respective MLP neural network 302 input nodes. 

Referring lo FIG. 4, during training, mixer 336 provides 
several different mixes of input data selected from FMQ 
407/HMMs 306 and FMQ 407 codebooks to MLP neural 
network 302 of speech recognition system 400. Seven 
illustrative mixes are defined in Table 2. 



30 



14 



35 



respective MLP neural network 302 input nodes. MIX6 
applies each entry of MIX 2 and MIX 3 to 2u respective 
MLP neural network 302 input nodes. MIX7applies each 
entry of MIX 1, MIX2, and MIX 3 to 3u respective MLP 
neural network 302 input nodes. 

Referring to FIGS. 3 and 4, the speech classifier MLP 
neural network 302 accepts mixed input data and is appro- 
priately designed using the well-known back propagation 
algorithm. The MLP neural network 302 is trained for the 
nth vocabulary word, using the back propagation algorithm, 
with the s SNR values of each of the r single word versions. 

After training the speech recognition system 300, path 2 
is selected to initiate a speech signal 304 recognition pro- 
cess. When any speech signal 304 word W M is spoken by a 
user, VAD 314 effectively defines end points of input words 
for IWSR. Speech input signal 304 word W n is next pre- 
processed by preprocess operation 312 as described above. 
Word W„ is sampled at, for example, 8 ksamples/sec, and 
segmented into TO frames, each frame, t seconds, such as 20 
msec with a 10 msec overlap of each consecutive frame, of 
W„. LPC operation 316 generates P LPC coefficients for 
each frame of the W w , and LSP operation 332 generates Pth 
order LSP coefficients from the LPC coefficients as 
described above. 

FMQ 308 utilizes interframe information related to the 
"evolution" of the speech short-term spectra envelopes of 
speech signal 304 word W M by operating on N consecutive 
speech frames of word W n . Since each frame is represented 
by the P order LSP coefficients, and N frames of speech 
input signal segment of word W„ provide. Each of T of 
speech signal 304 word W„ is represented by a PxN matrix 
of LSP coefficients, where T=int(TO/N). Word W„ may, 
thus, be represented as a matrix X W r„«{x 1 ,x 2 , . . . ,x r } of T, 
PxN matrices for each speech signal 304 word W^,, where 
each of the T, PxN matrices is defined as: 
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TABLE 2 




MLP neural 


MIX 


network 302 Input Data 


MIX1 


(FD} n 


MIX2 


{PROB} 


MIX3 


{COM} 


MDC4 


{FD n , PROB} 


MIX5 


{FD D , COM} 


MIX6 


{PROB, COM} 


MIX7 


{FD, PROB, COM} 



MIX1 represents that the u fuzzy distance measures 
{FD}„ for a given vocabulary word are directly applied to u 
input nodes of MLP neural network 302. MIX2 represents 
that for a given vocabulary word all of the u HMMs 306 
Pr'(Oj>v M ) maximum likelihood probabilities applied 
directly to the u MLP neural network 302 input nodes. M1X3 
represents that a combination {COM} of the u fuzzy dis- 
tance measures {FD}„ and u maximum likelihood probabili- 
ties {PROB} are applied to the u MLP neural network 302 
input nodes. Each entry of the combination {COM} is 
defined by FD„-aPr'(0 M |X„) for n»l, 2, . . . , u, where a is 
a scaling constant. MIX4 applies each entry of MIX 1 and 
MIX 2 to 2u respective MLP neural network 302 input 
nodes. MIX 5 applies each entry of MIX 1 and MIX 3 to 2u 



where xk(j)«[x 1 /x 2 /. . . x^-*]* j«l, 2, . . . , N, k=l, 2, . . . , 
T 

45 FMQ 308 fuzzy matrix quantizes the matrix representa- 
tion X Wn =xkQ) of word W„ with the designed C codebook 
entries for each of the u codebooks. FMQ 308 produces the 
distance measure FD„ for each of u fuzzy matrix codebooks 
in FMQ 308 with the smallest distance measure FD„ indi- 

50 eating which of the u codebooks is closest to W„. FMQ 308 
also yields an observation sequence 0„ of Tprobability mass 
vectors O,- for each of the u codebooks as discussed above. 
Observation sequence O n is used as input data by a fuzzy 
Viterbi algorithm 310 operating on each of the HMM X M 

55 processes of HMMs 306. The u outputs of the fuzzy Viterbi 
algorithm 310 are the maximum likelihood probability Pr' 
(O n |X n ) measures that corresponds to W n . 

The different MIXes in Tables 1 and 2 provide incremen- 
tal increases in speech recognition accuracy as well as 

60 increases in processing time for a given speech recognition 
system 300 processor. The speech recognition accuracy of 
all MIXes are nominally the same at high SNR ratios, 
Speech recognition system 300 adjusts the input data mix to 
MLP neural network 302 in accordance with various per- 

65 formance and recognition accuracy affecting factors. One 
such factor is acoustic noise SNRs. Relatively large acoustic 
noise ratios tend to decrease the recognition accuracy of 
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speech recognition system 300 for a given MIX. 
Accordingly, increasing the amount of data increases the 
recognition accuracy of speech recognition system 300 but 
decreases the speed performance of speech recognition 
system 300. Another such factor is the size u of the vocabu- 5 
lary of speech recognition system 300. As u is increased, less 
input data may be used to increase the speed performance of 
speech recognition system 300 when speed performance 
becomes an issue for a user. Selection control circuit 338 
preferably optimally balances speech recognition accuracy 10 
with desirable processing speed. 

Sensor 334 detects an acoustic noise level in an operating 
environment of speech recognition system 300 and provides 
a corresponding input signal to selection control circuit 338. 
Section control logic 338 utilizes the noise level information 15 
to select the MIX in Tables 1 and 2 that will yield a 
predetermined speech recognition accuracy in the least 
amount of time. 

After making the proper determination, selection control 
circuit 338 provides the appropriate MIX to MLP neural 20 
network 302. MLP neural network 302 then provides u 
output signals, OUT(l), OUT(2), . . . , OUT(u), which 
assume values in the region 0^OUT(n)^l. Decision logic 
340 classifies W„ as the nth vocabulary word if OUT(n)« 
max{OUT(l), OUT(2), . . . , OUT(u)}. 

The ability to selectively control the input mixes to a 
speech classifier offers flexibility of a speech recognition 
system, such as speech recognition systems 300 and 400, to 
adapt to varying environmental conditions and system plat- 
form constraints, where the system platform may include 
one or more processors executing code in memory to 
implement speech recognition system 300 and speech rec- 
ognition system 400, respectively. For example, in a car 
environment, noise levels change at different traveling 
speeds with general predictability. The selection control 
circuit 338 may in one embodiment receive car speed input 
data and access a database of information relating to noise 
levels at various traveling speeds. The noise level informa- 
tion corresponding to the car speed may be retrieved and 
utilized by selection control circuit 338 to select the mix 40 
from Table 1 that provides a satisfactory recognition rate in 
preferably the least amount of time. Additionally, if an 
unsatisfactory performance speed is detected by, for 
example, speech recognition systems 300 or 400, 
respectively, the MIX in Table 1 or Table 2, respectively, 45 
may be selected to raise performance speed to a predeter- 
mined satisfactory level. An unsatisfactory performance 
speed may arise when, for example, the size of the vocabu- 
lary u and the selected MIX requires computational 
resources that are at least temporarily unavailable. 

It will be recognized that a variety of other factors in 
addition to vocabulary size and dynamically detected SKR 
levels may affect recognition accuracy. Accordingly, selec- 
tion control circuit 338 may be designed to select an 
appropriate mix of input data to a speech classifier to 
accommodate these other factors as well. Also, other speech 
preprocessors, such as fuzzy and non fuzzy vector 
quantizers, and speech classifiers may be used in addition to 
or in substitution of the speech preprocessors ((MQ, FMQ)/ 
HMM) discussed herein. 60 

While the invention has been described with respect to the 
embodiments and variations set forth above, these embodi- 
ments and variations are illustrative and the invention is not 
to be considered limited in scope to these embodiments and 
variations. For example, other types of speech preprocessors 65 
may be used to provide output data which may be appro- 
priately mixed, such as the single robust codebook/HMM 
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preprocessor described in a currently filed U.S. patent appli- 
cation Ser. No. 08/883,979, filed Jun. 27, 1997, entitled 
"Speech Recognition System Using a Single Robust Code- 
book" by Saldar M. Asghar and Lin Cong, which is incor- 
porated herein in its entirety. Also, it will be recognized that 
additional combinations of data other than as listed in Table 
1 may be generated. Accordingly, various other embodi- 
ments and modifications and improvements not described 
herein may be within the spirit and scope of the present 
invention, as defined by the following claims. 
What is claimed is: 

1. A speech recognition system comprising: 

a first speech signal preprocessor to receive first input data 
representing a speech input signal and having first 
speech input signal preclassifying output data; 

a second speech signal preprocessor to receive second 
input data representing the speech input signal and 
having second speech input signal preclassifying out- 
put data; 

a mixer to receive the first and second speech input signal 
preclassifying output data and having output data rep- 
resented by a selected mix of the first and second 
speech input signal preclassifying output data; 

a selection control circuit coupled to the mixer to deter- 
mine the selected mix of the first and second speech 
input signal preclassifying output data by determining 
an appropriate balance between speech recognition 
accuracy of the speech recognition system and a speech 
recognition processing speed of the speech recognition 
system; and 

a speech classifier to receive the selected mix and having 
output data to classify the speech input signal as 
recognized speech. 

2. The speech recognition system of claim 1 wherein the 
selection control circuit is capable of dynamically selecting 
the selected mix based on predetermined parameters. 

3. The speech recognition system of claim 1 further 
comprising: 

a noise level detection sensor to provide a noise level 
parameter output signal to the selection control circuit. 

4. The speech recognition system of claim 1 wherein the 
first speech signal preprocessor comprises: 

a fuzzy matrix quantizer, wherein the first speech input 
signal preclassifying output data of the fuzzy matrix 
quantizer are fuzzy distance measures between a 
speech input signal representation matrix and respec- 
tive fuzzy matrix codebooks. 

5. The speech recognition system of claim 1 wherein the 
second speech signal preprocessor comprises: 

a plurality of hidden Markov models each modeling a 
respective word in a predetermined vocabulary, 
wherein the second input data representing the speech 
input signal is an observation sequence produced by the 
first speech signal preprocessor; and 

a probability module to determine respective probabilities 
of each hidden Markov model producing the observa- 
tion sequence representing the speech input signal. 

6. The speech recognition system of claim 5 wherein the 
probability module includes a Viterbi algorithm. 

7. The speech recognition system of claim 1 wherein the 
first input data representing the speech input signal com- 
prises X order line spectral pair coefiBcients. 

8. The speech recognition system of claim 1 wherein the 
speech classifier is a multilevel perceptron neural network. 

9. The speech recognition system of claim 1 wherein the 
selected mix of the first and second speech input signal 
preclassifying output data is selected from the group com- 
prised of 
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(i) the first speech input signal preclassifying output data 
alone, 

(ii) the second speech input signal preclassifying output 
data alone, 

(iii) a combination of the first and second speech input s 
signal preclassifying output data, 

(iv) the first speech input signal preclassifying output data 
and the second speech input signal preclassifying out- 
put data, 

10 

(v) the first speech input signal preclassifying output data 
and the combination of the first and second speech 
input signal preclassifying output data, 

(vi) the second speech input signal preclassifying output 
data and the combination of the first and second speech 15 
input signal preclassifying output data, and 

(vii) the first speech input signal preclassifying output 
data, the combination of the first and second speech 
input signal preclassifying output data, and the second 
speech input signal preclassifying output data. 20 

10. The speech recognition system of claim 1 wherein the 
first speech input signal preclassifying output data is fuzzy 
distance measures between the first input data representing 
the speech input signal and respective reference codebooks 

of the first speech signal preprocessor. 25 

11. The speech recognition system of claim 1 further 
comprising: 

decision logic coupled to the speech classifier to receive 
the output data from the speech classifier and to classify 
the speech input signal as a word selected from a 30 
predetermined vocabulary. 

12. The speech recognition system of claim 1 farther 
comprising: 

a processor; 

a memory coupled to the processor and having processor 35 
executable code for implementing the first and second 
speech signal preprocessors, the mixer and the speech 
classifier. 

13. The speech recognition system of claim 1 wherein the 
selection control circuit is capable of determining an appro- 40 
priate balance between the speech recognition accuracy of 
the speech recognition system and the speech recognition 
processing speed of the speech recognition system in accor- 
dance with factors affecting speech recognition accuracy and 
speech recognition processing speed, wherein such factors 45 
are selected from the group comprising a vocabulary size of 
the speech recognition system and noise levels of an envi- 
ronment of the speech recognition system. 

14. A speech recognition system comprising: 

a speech input signal feature extractor to provide param- 
eters representing features of T groups of N speech 
input signal frames; 

a vocabulary of u words; 

a matrix quantizer to receive the parameters and to 5S 
provide (i) a series of observation sequences for each of 
the T groups of the N speech input signal frames and 
(ii) distance measure output data between the param- 
eters and u respective matrix codebooks; 

a plurality of u hidden Markov models coupled to the 60 
matrix quantizer to receive the observation sequences; 

a Viterbi algorithm module to receive the observation 
sequences and provide respective probabilities that the 
respective hidden Markov models produced a respec- 
tive observation sequence; $5 

a selection control circuit to determine when the distance 
measure output, the probabilities, and a combination of 
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the distance measure output and the probabilities are 
included in a plurality of selected mixes by determining 
an appropriate balance between speech recognition 
accuracy of the speech recognition system and a speech 
recognition processing speed of the speech recognition 
system; 

a mixer coupled to the matrix quantizer and the Viterbi 
algorithm module for mixing the distance measure 
output and the probabilities into one set of mixed 
output data based on the selected mixes; and 

a neural network coupled to the mixer to receive the 
mixed output data set and determine which of the u 
vocabulary words most probably represents the speech 
input signal. 

15. The speech recognition system of claim 14 wherein 
the matrix quantizer is a fuzzy matrix quantizer, the distance 
measures are fuzzy distance measures, and the observation 
sequence is a vector of indices representing the relative 
closeness of each of the parameters and codewords in the 
respective matrix codebooks. 

16. The speech recognition system of claim 14 wherein 
the predetermined mixed output data sets include: 

(i) the distance measure output preclassifying output data 
alone, 

(ii) the probabilities preclassifying output data alone, 

(iii) a combination of the distance measure output and 
probabilities preclassifying output data, 

(iv) the distance measure output preclassifying output 
data and the probabilities preclassifying output data, 

(v) the distance measure output preclassifying output data 
and the combination of the distance measure output and 
probabilities preclassifying output data, 

(vi) the probabilities preclassifying output data and the 
combination of the distance measure output and prob- 
abilities preclassifying output data, and 

(vii) the distance measure output preclassifying output 
data, the combination of the distance measure output 
and probabilities preclassifying output data, and the 
probabilities preclassifying output data. 

17. The speech recognition system of claim 14 wherein 
the speech input signal feature extractor comprises: 

an X order linear predictive code (LPC) module to 

determine X LPC coefficients; and 
a line spectral pair (LSP) module to determine X LSPs 

from the X LPC coefficients. 

18. The speech recognition system of claim 14 wherein 
the selection control circuit is capable of determining an 
appropriate, balance between the speech recognition accu- 
racy of the speech recognition system and the speech 
recognition processing speed of the speech recognition 
system in accordance with factors affecting speech recog- 
nition accuracy and speech recognition processing speed, 
wherein such factors are selected from the group comprising 
a vocabulary size of the speech recognition system and noise 
levels of an environment of the speech recognition system. 

19. The speech recognition system of claim 14 further 
comprising a noise level detector to provide a noise level 
parameter output signal to the selection control circuit. 

20. A speech recognition system comprising: 

means for processing first speech input signal data to 
preclassify the speech input signal and produce first 
preclassification output data, wherein the first speech 
input signal data represents a speech input signal; 

means for processing second speech input signal data to 
preclassify the speech input signal and produce second 
preclassification output data; 



11/28/2003, EAST Version: 1.4.1 



6,0^ 

19 

means, coupled to both means for processing, for deter- 
mining when to include the first speech input signal, the 
second speech input signal, and a combination of the 
first and second speech input signals in a preferred mix 
of the preclassification output data by determining an 
appropriate balance between speech recognition accu- 
racy of the speech recognition system and a speech 
recognition processing speed of the speech recognition 
system; 

means, coupled to the means for determining, for mixing 
the first and second preclassification output data in 
accordance with the determined preferred mix; 

means, coupled to the means for mixing, for classifying 
the speech input signal based on the preferred mix of 
preclassification output data. 

21. The speech recognition system of claim 20 farther 
comprising means to provide a noise level parameter output 
signal to the means for determining. 

22. A speech recognition method comprising the steps of: 
processing first speech input signal data to preclassify the 

speech input signal and produce first preclassification 
output data, wherein the first speech input signal data 
represents a speech input signal; 

processing second speech input signal data to preclassify 
the speech input signal and produce second preclassi- 
fication output data; 

determining when to include the first speech input signal, 
the second speech input signal, and a combination of 
the first and second speech input signals in a preferred 
mix of the preclassification output data by determining 
at least an appropriate balance between speech recog- 
nition accuracy and a speech recognition processing 
speed; 

mixing the first and second preclassification output data in 

accordance with the preferred mix; and 
classifying the speech input signal based on the preferred 

mix of preclassification output data. 

23. The speech recognition method of claim 22 wherein 
step of processing first speech input signal data comprises 
the step of: 

fuzzy matrix quantizing a plurality of the first speech 
input signal data; 

determining a fuzzy distance measure between the fuzzy 
matrix quantized first speech input signal data and a 
plurality of fuzzy matrix codebooks, wherein the first 
preclassification output data includes the fuzzy distance 
measure. 

24. The speech recognition method of claim 22 further 
comprising the steps of: 

training a first speech processor for processing the first 
speech input signal data with temporally related data 
from speech input signals corrupted with acoustic noise 
at a plurality of signal to noise ratios; 

training a second speech processor for processing the 
second speech input signal data with temporally related 
data from the speech input signals corrupted with the 
acoustic noise at the plurality of signal to noise ratios; 
and 

training a speech classifier to classify the speech input 
signal with a plurality of preclassification output data 
mixes. 

25. The speech recognition method of claim 22 wherein 
the processing first speech input signal data step further 
comprises the step of: 

determining an observation sequence of indices represent- 
ing a relative closeness between the first speech input 
signal data and a plurality of codebooks. 
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26. The speech recognition method of claim 22 further 
comprising the steps of: 

receiving TO speech input signals, wherein the TO speech 
input signals define an input speech word; 

5 representing each of the TO speech input signals with P 
LSP coefficients; 
representing each group of N frames of the speech input 
signals with a respective PxN matrix; 

10 determining the relative closeness between each PxN 
matrix and each codeword in a fuzzy matrix codebook, 
wherein an observation sequence vector of indices is 
produced for each PxN matrix, and the indices are the 
second speech input signal data; 

15 determining a distance between each PxN matrix and 
each of the codewords; and 
weighting the distance between each PxN matrix and each 
of the codewords with respective indices of the obser- 
vation sequence vector corresponding to the respective 

20 PxN matrix to obtain an overall fuzzy distance 
measure, wherein the fuzzy distance measure is the first 
preclassification output data. 

27. The speech recognition method of claim 22 wherein 
the step of determining the preferred mix of the preclassi- 

25 fication output data comprises the steps of: ) 

selecting a mix of the preclassification output data to 
obtain a predetermined satisfactory recognition accu- 
racy in the least amount of time. 

28. The speech recognition method of claim 27 wherein 
30 the preferred mix is selected from the group comprising 

(i) the first speech input signal preclassifying output data 
alone, 

(ii) the second speech input signal preclassifying output 
35 data alone, 

(iii) a combination of the first and second speech input 
signal preclassifying output data, 

(iv) the first speech input signal preclassifying output data 
and the second speech input signal preclassifying out- 

40 put data, 

(v) the first speech input signal preclassifying output data 
and the combination of the first and second speech 
input signal preclassifying output data, 

45 (vi) the second speech input signal preclassifying output 
data and the combination of the first and second speech 
input signal preclassifying output data, and 
(vii) the first speech input signal preclassifying output 
data, the combination of the first and second speech 

50 input signal preclassifying output data, and the second 
speech input signal preclassifying output data. 

29. The speech recognition method of claim 22 wherein 
second speech input signal data is an observation sequence 
of indices of relative closeness of a representation of the 

ss speech input signal to codewords in a reference codebook, 
and the step of processing second speech input signal data 
comprises the step of: 

determining with a fuzzy Viterbi algorithm a respective 
probability for each of u hidden Markov models that 
60 the hidden Markov model produced the observation 
sequence, wherein the second preclassification output 
data are the u determined respective probabilities. 

30. The speech recognition method of claim 22 wherein 
the step of classifying the speech input signal comprises the 

65 step of: 

classifying the speech input signal with a multilayer 
perceptron neural network. 
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31. The speech recognition method of claim 22 wherein 
determining an appropriate balance between the speech 
recognition accuracy and the speech recognition processing 
speed comprises utilizing factors affecting speech recogni- 
tion accuracy and speech recognition processing speed, 
wherein such factors are selected from the group comprising 
a vocabulary size and noise levels of an environment. 

32. A speech recognition system comprising: 

a first speech signal preprocessor to receive first input data 
representing a speech input signal and having first 
speech input signal preclassifying output data; 

a second speech signal preprocessor to receive second 
input data representing the speech input signal and 
having second speech input signal preclassifying out- 
put data; 

a mixer to receive the first arid second speech input signal 
preclassifying output data and having output data rep- 
resented by a selected mix of the first and second 
speech input signal preclassifying output data; 

a non-neural network selection control circuit coupled to 
the mixer to determine when to include the first speech 
input signal, the second speech input signal, and a 
combination of the first and second speech input signals 
in the selected mix; and 

a speech classifier to receive the selected mix and having 
output data to classify the speech input signal as 
recognized speech. 

33. A speech recognition system comprising: 

a first speech signal preprocessor to receive first input data 
representing a speech input signal and having first 
speech input signal preclassifying output data; 
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a second speech signal preprocessor to receive second 
input data representing the speech input signal and 
having second speech input signal preclassifying out- 
put data; 

a mixer to receive the first and second speech input signal 
preclassifying output data and having output data rep- 
resented by a selected mix of the first and second 
speech input signal preclassifying output data; 

a selection control circuit coupled to the mixer to deter- 
mine when to include the first speech input signal, the 
second speech input signal, and a combination of the 
first and second speech input signals in the selected 
mix; 

a speech classifier to receive the selected mix and having 
output data to classify the speech input signal as 
recognized speech; and 

a noise level detector to provide a noise level parameter 
output signal to the selection control circuit. 

34. The speech recognition system of claim 33 wherein 
the noise level detector comprises a noise level detection 
sensor to detect noise levels which may corrupt at least one 
of the first input data and the second input data. 

35. The speech recognition system of claim 33 wherein 
the noise level detector comprises: 

a database of noise level information corresponding to 
noise levels at different traveling speeds of a vehicle; 
and 

a data retriever to retrieve noise level information from 
the database of noise level information corresponding 
to a traveling speed of the vehicle. 
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